Skip to content

Add support for running vLLM#799

Merged
amaslenn merged 50 commits intomainfrom
am/vllm
Feb 12, 2026
Merged

Add support for running vLLM#799
amaslenn merged 50 commits intomainfrom
am/vllm

Conversation

@amaslenn
Copy link
Contributor

Summary

Two modes are supported at the moment, single node only:

  1. Disaggregated run.
  2. Non-disaggregated run.

Test Plan

  1. CI (extended)
  2. Manual runs.

Additional Notes

amaslenn and others added 3 commits February 10, 2026 14:04
Co-authored-by: coderabbitai[bot] <136622811+coderabbitai[bot]@users.noreply.github.com>
…eval_strategy.py

Co-authored-by: coderabbitai[bot] <136622811+coderabbitai[bot]@users.noreply.github.com>
Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 6

🤖 Fix all issues with AI agents
In `@doc/workloads/vllm.rst`:
- Around line 92-93: The disaggregated TOML snippet is ambiguous because
[extra_env_vars] is shown at top level without context; update the example in
vllm.rst to show a complete TOML block including the top-level keys (e.g., name,
test_template_name, executor/test settings) and explicitly show whether
[extra_env_vars] is a sibling of [cmd_args] or nested under it (for example,
include [cmd_args] with its keys then a separate [extra_env_vars] section), so
readers can unambiguously see the intended section hierarchy and placement of
CUDA_VISIBLE_DEVICES.
- Line 73: Replace the awkward phrase "from less priority to more priority" in
the sentence "The number of GPUs can be controlled using the options below,
listed from less priority to more priority:" with a clearer alternative such as
"from lowest to highest priority" or "in order of increasing priority" so the
sentence reads e.g. "The number of GPUs can be controlled using the options
below, listed from lowest to highest priority:"; update the string where that
sentence appears in the vllm.rst documentation.

In `@src/cloudai/workloads/vllm/report_generation_strategy.py`:
- Around line 47-48: The use of functools.cache on parse_vllm_bench_output
causes indefinite memoization by Path and can return stale results if the file
changes; update the function to either remove the `@cache` decorator or change the
cache key to include the file's modification state (e.g., use an explicit
memoization keyed by (res_file, res_file.stat().st_mtime) or a TTL/lru cache) so
cached entries are invalidated when the file is updated; locate
parse_vllm_bench_output and replace the `@cache` usage with one of these
strategies to ensure fresh results for changed files.
- Around line 53-58: The except clause in the block that opens res_file and
calls VLLMBenchReport.model_validate(data) is redundant because
json.JSONDecodeError is already an Exception; update the except clause from
"except (json.JSONDecodeError, Exception):" to a single "except Exception:" so
it no longer lists duplicate exception types while preserving the current error
handling behavior.

In `@src/cloudai/workloads/vllm/slurm_command_gen_strategy.py`:
- Around line 258-268: The script launches the proxy in background (proxy_cmd,
PROXY_PID) and immediately starts the benchmark (bench_cmd), causing potential
failures if the proxy isn't ready; update the generated shell to wait for proxy
readiness by invoking the existing wait_for_health helper (or a short sleep)
against the proxy endpoint after starting the proxy and before running
bench_cmd, ensuring the health check references the same proxy port/URL used by
proxy_cmd and still retains PROXY_PID handling.

In `@tests/slurm_command_gen_strategy/test_vllm_slurm_command_gen_strategy.py`:
- Around line 55-60: The fixture vllm_disagg_tr mutates the shared vllm fixture;
instead create a fresh VllmTestDefinition instance (or deep copy the existing
vllm) inside vllm_disagg_tr, set its extra_env_vars =
{"CUDA_VISIBLE_DEVICES":"0,1,2,3"} and its cmd_args.prefill = VllmArgs() on that
new instance, then pass the new instance to TestRun(test=...) so vllm remains
unchanged; reference the vllm_disagg_tr fixture, VllmTestDefinition (or use
copy.deepcopy(vllm)), TestRun, and VllmArgs when making the change.

Co-authored-by: coderabbitai[bot] <136622811+coderabbitai[bot]@users.noreply.github.com>
@amaslenn amaslenn requested a review from podkidyshev February 10, 2026 13:35
@greptile-apps
Copy link
Contributor

greptile-apps bot commented Feb 10, 2026

Greptile Overview

Greptile Summary

This PR adds comprehensive vLLM support to CloudAI with both aggregated (single instance) and disaggregated (separate prefill/decode) modes for single-node execution.

Key Changes:

  • Implemented VllmTestDefinition with flexible GPU assignment via CUDA_VISIBLE_DEVICES or explicit gpu_ids
  • Created command generation strategy handling disaggregated mode with prefill/decode instances, proxy coordination, and health checks
  • Added report generation parsing JSON benchmark results for TTFT and TPOT metrics
  • Comprehensive test coverage for GPU detection, command generation, and both execution modes
  • Documentation with clear examples for configuration

Critical Issues:

  • Benchmark command formatting bug in slurm_command_gen_strategy.py:116-126 embeds arguments as --model {value} strings instead of separate list items, which will cause shell execution failures
  • Extra args formatting on line 111 creates single-string args instead of separate list elements
  • Log parsing on vllm.py:130 uses fragile split()[2] without validation

Architecture:
The disaggregated mode launches prefill and decode vLLM instances on different GPU sets, uses NixlConnector for KV cache transfer, coordinates via a proxy server, and waits for health checks before running benchmarks.

Confidence Score: 2/5

  • This PR has critical command formatting bugs that will cause benchmark execution failures in production
  • The benchmark command construction embeds arguments within strings (--model {value}) instead of as separate list elements, which will break when the command is executed by the shell. This affects core functionality and will prevent benchmarks from running. While the overall architecture is sound and tests are comprehensive, these execution bugs are blocking issues.
  • Pay close attention to src/cloudai/workloads/vllm/slurm_command_gen_strategy.py - the command formatting bugs must be fixed before merge

Important Files Changed

Filename Overview
src/cloudai/workloads/vllm/vllm.py Core vLLM implementation with argument models and success detection logic. Has a fragile log parsing issue that could fail with format variations.
src/cloudai/workloads/vllm/slurm_command_gen_strategy.py Command generation for both aggregated and disaggregated modes. Critical bug in benchmark command formatting where arguments are embedded in strings instead of being separate list items.
src/cloudai/workloads/vllm/report_generation_strategy.py Report generation from JSON output with clean error handling and clear metrics display.
tests/slurm_command_gen_strategy/test_vllm_slurm_command_gen_strategy.py Comprehensive test coverage for GPU detection, command generation, and both aggregated/disaggregated modes.

Sequence Diagram

sequenceDiagram
    participant User
    participant CloudAI
    participant Slurm
    participant Container
    participant vLLM_Prefill
    participant vLLM_Decode
    participant Proxy
    participant Benchmark

    User->>CloudAI: Submit vLLM test (disaggregated mode)
    CloudAI->>Slurm: Generate sbatch script
    Slurm->>Container: Start prefill instance with CUDA_VISIBLE_DEVICES
    Container->>vLLM_Prefill: vllm serve --kv-transfer-config (producer)
    Slurm->>Container: Start decode instance with CUDA_VISIBLE_DEVICES
    Container->>vLLM_Decode: vllm serve --kv-transfer-config (consumer)
    
    CloudAI->>vLLM_Prefill: Health check /health endpoint
    vLLM_Prefill-->>CloudAI: Ready
    CloudAI->>vLLM_Decode: Health check /health endpoint
    vLLM_Decode-->>CloudAI: Ready
    
    Slurm->>Container: Start proxy server
    Container->>Proxy: python3 toy_proxy_server.py
    Proxy->>vLLM_Prefill: Connect to prefill port
    Proxy->>vLLM_Decode: Connect to decode port
    
    Slurm->>Container: Run benchmark
    Container->>Benchmark: vllm bench serve
    Benchmark->>Proxy: Send requests to port 8000
    Proxy->>vLLM_Prefill: Forward prefill requests
    vLLM_Prefill->>vLLM_Decode: Transfer KV cache via NixlConnector
    vLLM_Decode->>Proxy: Return generated tokens
    Proxy->>Benchmark: Return responses
    Benchmark->>Container: Write results to vllm-bench.json
    Container-->>CloudAI: Job complete
    CloudAI->>User: Generate report with metrics
Loading

Copy link
Contributor

@greptile-apps greptile-apps bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

14 files reviewed, 2 comments

Edit Code Review Agent Settings | Greptile

@greptile-apps
Copy link
Contributor

greptile-apps bot commented Feb 10, 2026

Additional Comments (2)

src/cloudai/workloads/vllm/slurm_command_gen_strategy.py
Incorrect argv tokenization

get_vllm_bench_command() is returning items like "--model {cmd_args.model}" and extras like "--extra 1" as single list elements (later " ".join(...)). When this gets executed, flags/values won’t be passed as distinct argv tokens and any value containing spaces (or needing quoting) will be mis-parsed. This will break benchmark invocation for legitimate inputs.

Consider returning [..., "--model", cmd_args.model, "--base-url", f"http://0.0.0.0:{cmd_args.port}", ...] and for extras extras.extend([f"--{k}", str(v)]) so the script can safely join/execute without re-tokenization.


src/cloudai/workloads/vllm/vllm.py
Brittle success parsing

was_run_successful() parses the successful request count with int(line.split()[2]). If vLLM’s output format changes slightly (extra columns, different spacing, etc.), this will throw and you’ll fall through to reporting failure even when results are present (the exception is swallowed and the loop continues). A small regex like r"Successful requests:\s*(\d+)" (or split from the colon) would make this robust and avoid false negatives.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Fix all issues with AI agents
In `@src/cloudai/workloads/vllm/report_generation_strategy.py`:
- Around line 31-45: VLLMBenchReport defines std_ttft_ms and std_tpot_ms but
they aren't shown in the generated report table; either remove these fields or
add them to the displayed metrics—update the report-generation code that
currently renders mean_ttft_ms/median_ttft_ms/p99_ttft_ms and
mean_tpot_ms/median_tpot_ms/p99_tpot_ms to also include std_ttft_ms and
std_tpot_ms (add headers, column values and formatting consistent with the other
stats), or delete std_ttft_ms/std_tpot_ms from VLLMBenchReport if intentionally
unused; ensure any serialization/deserialization and tests reference the updated
schema.

Copy link
Contributor

@greptile-apps greptile-apps bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

14 files reviewed, no comments

Edit Code Review Agent Settings | Greptile

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Fix all issues with AI agents
In `@src/cloudai/workloads/vllm/report_generation_strategy.py`:
- Around line 64-65: The cache key issue comes from passing potentially
non-normalized Path objects into parse_vllm_bench_output from
can_handle_directory; update can_handle_directory to resolve the path (e.g.,
call self.test_run.output_path.resolve() or resolve() on the
VLLM_BENCH_JSON_FILE path) before passing it to parse_vllm_bench_output so the
cached key is consistent with generate_report and other callers, and likewise
ensure any other call sites (like generate_report) also resolve the path before
invoking parse_vllm_bench_output to avoid inconsistent cache hits/misses.

Copy link
Contributor

@greptile-apps greptile-apps bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

14 files reviewed, 1 comment

Edit Code Review Agent Settings | Greptile

Copy link
Contributor

@greptile-apps greptile-apps bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

14 files reviewed, 3 comments

Edit Code Review Agent Settings | Greptile

Copy link
Contributor

@greptile-apps greptile-apps bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

15 files reviewed, 6 comments

Edit Code Review Agent Settings | Greptile

Copy link
Contributor

@greptile-apps greptile-apps bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

15 files reviewed, 1 comment

Edit Code Review Agent Settings | Greptile

Co-authored-by: Ivan Podkidyshev <raashicat@gmail.com>
Copy link
Contributor

@greptile-apps greptile-apps bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

15 files reviewed, 3 comments

Edit Code Review Agent Settings | Greptile

@amaslenn amaslenn merged commit 0e23faa into main Feb 12, 2026
4 checks passed
@amaslenn amaslenn deleted the am/vllm branch February 12, 2026 16:54
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants